Executing data-parallel iterative algorithms on large datasets is cru-cial for many advanced analytical applications in the fields of data mining and machine learning. Current systems for executing itera-tive tasks in large clusters typically achieve fault tolerance through rollback recovery. The principle behind this pessimistic approach is to periodically checkpoint the algorithm state. Upon failure, the system restores a consistent state from a previously written check-point and resumes execution from that point. We propose an optimistic recovery mechanism using algorithmic compensations. Our method leverages the robust, self-correcting nature of a large class of fixpoint algorithms used in data mining and machine learning, which converg...
We consider the problem of bringing a distributed system to a consistent state after transient fail...
In traditional distributed simulation schemes, entire simulation needs to be restarted if any of the...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
Large-scale graph and machine learning analytics widely employ distributed iterative processing. Typ...
We propose a new algorithm for recovering asynchronously from failures in a distributed computation....
dbj ©rice.edu In a distributed system using rollback recovery, information saved on stable storage d...
High-Performance Computing (HPC) has passed the Petascale mark and is moving forward to Exascale. As...
This thesis studies a forward recovery strategy using checkpointing and optimistic execution in para...
Several recovery techniques for parallel iterative methods are presented. First, the implementation ...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...
Real-world graph processing applications often require combining the graph data with tabular data. M...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
.... Abstract a process is logged on stable storage [5], and each process is occasionally checkpoint...
In this paper, we present a new protocol for optimistic rollback recovery in distributed systems. Th...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
We consider the problem of bringing a distributed system to a consistent state after transient fail...
In traditional distributed simulation schemes, entire simulation needs to be restarted if any of the...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
Large-scale graph and machine learning analytics widely employ distributed iterative processing. Typ...
We propose a new algorithm for recovering asynchronously from failures in a distributed computation....
dbj ©rice.edu In a distributed system using rollback recovery, information saved on stable storage d...
High-Performance Computing (HPC) has passed the Petascale mark and is moving forward to Exascale. As...
This thesis studies a forward recovery strategy using checkpointing and optimistic execution in para...
Several recovery techniques for parallel iterative methods are presented. First, the implementation ...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...
Real-world graph processing applications often require combining the graph data with tabular data. M...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
.... Abstract a process is logged on stable storage [5], and each process is occasionally checkpoint...
In this paper, we present a new protocol for optimistic rollback recovery in distributed systems. Th...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
We consider the problem of bringing a distributed system to a consistent state after transient fail...
In traditional distributed simulation schemes, entire simulation needs to be restarted if any of the...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...